Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor

نویسندگان

چکیده

Apache Spark has recently become the most popular big data analytics framework. Default configurations are provided by Spark. HDFS stands for Hadoop Distributed File System. It means large files will be physically stored on multiple nodes in a distributed fashion. The block size determines how distributed, while replication factor reliable are. If there is just one copy of each given file and node fails, unreadable. configurable per file. results analysis experimental study to determine efficiency adjusting settings tuning minimizing application execution time as compared standard values described this paper. Based vast number studies, we employed trial-anderror strategy fine-tune these values. We chose two workloads test framework comparative analysis: Wordcount Terasort. used elapsed evaluate same.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

a comparison of teachers and supervisors, with respect to teacher efficacy and reflection

supervisors play an undeniable role in training teachers, before starting their professional experience by preparing them, at the initial years of their teaching by checking their work within the proper framework, and later on during their teaching by assessing their progress. but surprisingly, exploring their attributes, professional demands, and qualifications has remained a neglected theme i...

15 صفحه اول

Static and Dynamic Big Data Partitioning on Apache Spark

Many of today’s large datasets are organized as a graph. Due to their size it is often infeasible to process these graphs using a single machine. Therefore, many software frameworks and tools have been proposed to process graph on top of distributed infrastructures. This software is often bundled with generic data decomposition strategies that are not optimised for specific algorithms. In this ...

متن کامل

Towards Large Scale Environmental Data Processing with Apache Spark

Currently available environmental datasets are either manually constructed by professionals or automatically generated from the observations provided by sensing devices. Usually, the former are modelled and recorded with traditional general-purpose relational technologies, whereas the latter require more specific scientific array formats and tools. Declarative data processing technologies are a...

متن کامل

SPARQL query processing with Apache Spark

The number and the size of linked open data graphs keep growing at a fast pace and confronts semantic RDF services with problems characterized as Big data. Distributed query processing is one of them and needs to be efficiently addressed with execution guaranteeing scalability, high availability and fault tolerance. RDF data management systems requiring these properties are rarely built from sc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: SAMRIDDHI : A Journal of Physical Sciences, Engineering and Technology

سال: 2022

ISSN: ['2229-7111', '2454-5767']

DOI: https://doi.org/10.18090/samriddhi.v14i02.4